Module 11: Handling Scale in Serverless Applications

Developing Serverless Solutions on AWS

Scaling considerations for serverless
API Gateway throttling & caching
Lambda concurrency scaling
How different event sources scale
Kinesis scaling (shards, parallelization, fan-out)
Build, measure, learn, repeat

Thinking Serverless at Scale

Analogy: Highway system. Each service has a lane capacity (quota). If one lane jams (throttle), traffic backs up. You need to know each road's speed limit and plan alternate routes.

Know the quotas for every service you use
Focus on trade-offs between services (speed vs cost vs reliability)
Load test with production-like traffic patterns
Monitor in production and tune continuously
Stay current with service updates (quotas increase over time)

API Gateway - Managing Scale

Feature	What It Does	Analogy
Account Quota	10,000 requests/sec across all APIs (default)	Highway speed limit for your region
Burst Capacity	5,000 requests immediate burst (token bucket)	Passing lane - short bursts allowed
Stage Throttling	Per-stage rate/burst limits	Speed limit per road segment
Route/Method	Per-route throttle (HTTP) or per-method (REST)	Speed bumps on specific streets
Usage Plans	Per-client throttle + monthly quota (API keys)	Toll pass with monthly limit per driver
Response Cache	REST APIs: cache responses to reduce backend calls	Saved answer - don't re-ask the same question

Throttling applies most-granular first: Client > Route > Stage > Account

Lambda Concurrency Scaling

Provisioned = reserved seats on a plane (always yours). On-demand = standby (might wait). Throttled = flight is full, come back later.

Function Duration Impacts Concurrency & Cost

Duration	10 req/sec needs	100 req/sec needs	Insight
100ms	1 concurrent	10 concurrent	Low concurrency, low cost
1 sec	10 concurrent	100 concurrent	Moderate
10 sec	100 concurrent	1,000 concurrent	Hits default quota!
60 sec	600 concurrent	6,000 concurrent	Way over default limit

Concurrency = seats in a restaurant. If each customer stays 10 minutes, you need 10 seats for 1 customer/minute. If they stay 60 minutes, you need 60 seats for the same rate. Shorter functions = fewer seats needed = cheaper.

Scaling with Sync & Async Sources

Sync = drive-through (you wait in line, feel every delay). Async = online order (submit and go, they process when ready).

Scaling with SQS Event Source

Lambda starts with 5 poller processes
Adds 60 instances/minute as queue depth grows
Scales up to 1,000 concurrent (or your reserved limit)
Scales DOWN if error rate is too high (backoff protection)
Maximum Concurrency setting caps scaling per event source mapping

Tuning for Scale

Setting	Impact on Scale
Batch size (1-10,000)	Larger batch = fewer invocations needed
Batch window (0-5min)	Wait to fill batch = fewer invocations, higher latency
Visibility timeout	Set 6x function timeout to avoid duplicate processing
Max concurrency	Cap to protect downstream services

Scaling with Kinesis Data Streams

Enhanced Fan-Out for Multiple Consumers

Feature	Standard Consumer	Enhanced Fan-Out
Throughput	2 MB/sec per shard SHARED across all consumers	2 MB/sec per shard PER consumer (dedicated)
Delivery	Pull (poll-based)	Push (SubscribeToShard)
Latency	~200ms avg	~70ms avg
Best for	1-2 consumers, cost-sensitive	3+ consumers, low latency critical

Standard = shared TV antenna (split signal gets weaker per viewer). Enhanced fan-out = dedicated cable line per household (full bandwidth each).

Metrics That Indicate Scaling Issues

Service	Metric	What It Means
Lambda	Throttles	Hitting concurrency limit - increase quota or add reserved
Lambda	Duration (p99)	Approaching timeout - optimize or increase memory
Lambda	ConcurrentExecutions	Near quota - time to request increase
SQS	ApproximateAgeOfOldestMessage	Growing = processing can't keep up
SQS	ApproximateNumberOfMessages	Queue depth growing = add concurrency
Kinesis	IteratorAge	Growing = consumer falling behind producer
Kinesis	ReadProvisionedThroughputExceeded	Need more shards or enhanced fan-out
API GW	4XXError / 5XXError	Clients being throttled or backend failing

What's New (2024-2025)

Concurrency Scaling Rate - New regional limit on how fast concurrency can spike (replaces old burst model)
Maximum Concurrency for SQS - Cap Lambda scaling per event source mapping
Provisioned Concurrency Auto-Scaling - Application Auto Scaling adjusts PC based on utilization
Enhanced Fan-Out GA - Dedicated throughput per consumer at scale
HTTP API Route-Level Throttling - Granular rate limits per route
SnapStart - Eliminates cold start penalty at scale (Java/Python/.NET)

Q1: What is the formula for Lambda concurrency?

A) Requests per second / Memory size B) Requests per second x Average duration (seconds) C) Number of shards x Batch size D) Memory x Timeout

B) Requests/sec x Duration(sec)
10 req/sec with 1s duration = 10 concurrent. Shorter functions = less concurrency needed = less cost.
A: Memory doesn't affect concurrency count. C: That's Kinesis-specific. D: Those are config, not concurrency formula.

Q2: A Lambda function with SQS source is hitting throttles. What should you do FIRST?

A) Increase memory to 10GB B) Set Maximum Concurrency on the event source mapping C) Request a concurrency quota increase D) Delete the DLQ

C) Request concurrency quota increase
If throttling, you've hit the regional limit (default 1000). Request increase via Service Quotas.
A: Memory doesn't affect concurrency quota. B: Max Concurrency LIMITS scaling (opposite of what you want). D: DLQ handles failures, unrelated to throttle.

Q3: How do you increase Kinesis stream processing throughput?

A) Increase Lambda memory B) Add more shards and/or increase parallelization factor C) Increase API Gateway timeout D) Add more SQS queues

B) Add shards + increase parallelization factor
More shards = more ingest capacity + more concurrent Lambda instances. Parallelization (1-10) multiplies concurrency per shard.
A: Memory helps speed per invocation, not throughput at stream level. C: API GW isn't involved with Kinesis. D: SQS is a different source.

Q4: Why does async invocation reduce throttle impact on clients?

A) Async has no concurrency limit B) Client gets 202 immediately; Lambda retries internally for up to 6 hours C) Async functions run faster D) Async bypasses IAM checks

B) Client gets 202; Lambda handles retries internally
The client doesn't wait or see the throttle. Lambda queues the event and retries for up to 6 hours until concurrency is available.
A: Same concurrency limits apply. C: Same function, same speed. D: IAM always applies.

Live Demo: Load Testing & Observing Scale

Step 1: Deploy a function with reserved concurrency

aws lambda put-function-concurrency \
  --function-name my-api-handler \
  --reserved-concurrent-executions 10

Step 2: Generate load (exceed the limit)

# Install artillery for load testing
npm install -g artillery

# Create load test (20 req/sec for 30 seconds)
artillery quick --count 20 --num 30 \
  https://API_ID.execute-api.us-west-2.amazonaws.com/prod/items

Step 3: Observe in CloudWatch

# Watch throttles in real-time
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda --metric-name Throttles \
  --dimensions Name=FunctionName,Value=my-api-handler \
  --start-time $(date -u -d '5 minutes ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 60 --statistics Sum

Demo: What to Show

Action	What Students See
Set reserved=10, send 20 req/sec	Throttles appear, 429 errors in client
Increase reserved to 50	Throttles disappear, all requests succeed
Remove reserved, show account pool	Scales freely up to 1000
Add provisioned=10	First 10 are instant (no cold start), rest have cold starts
Show CloudWatch metrics	ConcurrentExecutions, Throttles, Duration graphs

Cleanup

aws lambda delete-function-concurrency --function-name my-api-handler
aws lambda delete-provisioned-concurrency-config \
  --function-name my-api-handler --qualifier prod

Module Summary

Consider quotas and trade-offs when choosing services
Apply API Gateway throttling to manage incoming traffic
Use reserved + provisioned concurrency to manage Lambda scaling
Account for event source scaling behavior (sync vs async vs polling)
Concurrency = requests x duration (shorter = cheaper + more headroom)
Kinesis: shards + parallelization factor = total concurrent
Monitor: Throttles, IteratorAge, QueueDepth, Duration p99
Build, measure, learn, repeat!